Policy Optimization as Online Learning with Mediator Feedback
نویسندگان
چکیده
Policy Optimization (PO) is a widely used approach to address continuous control tasks. In this paper, we introduce the notion of mediator feedback that frames PO as an online learning problem over policy space. The additional available information, compared standard bandit feedback, allows reusing samples generated by one estimate performance other policies. Based on observation, propose algorithm, RANDomized-exploration via Multiple Importance Sampling with Truncation (RANDOMIST), for regret minimization in PO, employs randomized exploration strategy, differently from existing optimistic approaches. When space finite, show under certain circumstances, it possible achieve constant regret, while always enjoying logarithmic regret. We also derive problem-dependent lower bounds. Then, extend RANDOMIST compact spaces. Finally, provide numerical simulations finite and spaces, comparison baselines.
منابع مشابه
Online Learning with Partial Feedback
In previous lectures we talked about the general framework of online convex optimization and derived an algorithm for prediction with expert advice from this general framework. To apply the online algorithm, we need to know the gradient of the loss function at the end of each round. In the prediction of expert advice setting, this boils down to knowing the cost of each individual expert. In thi...
متن کاملOnline Learning with Preference Feedback
We propose a new online learning model for learning with preference feedback. The model is especially suited for applications like web search and recommender systems, where preference data is readily available from implicit user feedback (e.g. clicks). In particular, at each time step a potentially structured object (e.g. a ranking) is presented to the user in response to a context (e.g. query)...
متن کاملOnline Learning with Feedback Graphs: Beyond Bandits
We study a general class of online learning problems where the feedback is specified by a graph. This class includes online prediction with expert advice and the multiarmed bandit problem, but also several learning problems where the online player does not necessarily observe his own loss. We analyze how the structure of the feedback graph controls the inherent difficulty of the induced T -roun...
متن کاملOnline Learning under Delayed Feedback
Online learning with delayed feedback has received increasing attention recently due to its several applications in distributed, web-based learning problems. In this paper we provide a systematic study of the topic, and analyze the effect of delay on the regret of online learning algorithms. Somewhat surprisingly, it turns out that delay increases the regret in a multiplicative way in adversari...
متن کاملModel-Free Imitation Learning with Policy Optimization
In imitation learning, an agent learns how to behave in an environment with an unknown cost function by mimicking expert demonstrations. Existing imitation learning algorithms typically involve solving a sequence of planning or reinforcement learning problems. Such algorithms are therefore not directly applicable to large, high-dimensional environments, and their performance can significantly d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2021
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v35i10.17083